On measuring and correcting the effects of data mining and model selection
نویسنده
چکیده
In the theory of linear models, the concept of degrees of freedom plays an important role. This concept is often used for measurement of model complexity, for obtaining an unbiased estimate of the error variance, and for comparison of different models. I have developed a concept of generalized degrees of freedom (GDF) that is applicable to complex modeling procedures. The definition is based on the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value. The concept is nonasymptotic in nature and does not require analytic knowledge of the modeling procedures. The concept of GDF offers a unified framework under which complex and highly irregular modeling procedures can be analyzed in the same way as classical linear models. By using this framework, many difficult problems can be solved easily. For example, one can now measure the number of observations used in a variable selection process. Different modeling procedures, such as a tree-based regression and a projection pursuit regression, can be compared on the basis of their residual sums of squares and the GDF that they cost. I apply the proposed framework to measure the effect of variable selection in linear models, leading to corrections of selection bias in various goodness-of-fit statistics. The theory also has interesting implications for the effect of general model searching by a human modeler.
منابع مشابه
A Hybrid DEA Based CHAID and Imperialist Competitive Algorithm for Stock Selection
In this paper, the investment portfolio is formed based on the data mining algorithm of CHAID on the basis of the risk status criteria. In the next step, the second investment portfolio is created based on the decision rules extracted by the DEA-BCC model. The final portfolio is created through a two-objective mathematical programming model based on the Imperialist Competitive algorithm.
متن کاملA Novel Method for Selecting the Supplier Based on Association Rule Mining
One of important problems in supply chains management is supplier selection. In a company, there are massive data from various departments so that extracting knowledge from the company’s data is too complicated. Many researchers have solved this problem by some methods like fuzzy set theory, goal programming, multi objective programming, the liner programming, mixed integer programming, analyti...
متن کاملA new model for mining method selection based on grey and TODIM methods
One of the most important steps involved in mining operations is to select an appropriate extraction method for mine resources. After choosing the extraction method, it is usually impossible to replace it with another one because it may be so expensive that implementation of the entire project could be economically impossible. Choosing a mining method depends on the geological and geometrical c...
متن کاملDeveloping a Course Recommender by Combining Clustering and Fuzzy Association Rules
Each semester, students go through the process of selecting appropriate courses. It is difficult to find information about each course and ultimately make decisions. The objective of this paper is to design a course recommender model which takes student characteristics into account to recommend appropriate courses. The model uses clustering to identify students with similar interests and skills...
متن کاملH-BwoaSvm: A Hybrid Model for Classification and Feature Selection of Mammography Screening Behavior Data
Breast cancer is one of the most common cancer in the world. Early detection of cancers cause significantly reduce in morbidity rate and treatment costs. Mammography is a known effective diagnosis method of breast cancer. A way for mammography screening behavior identification is women's awareness evaluation for participating in mammography screening programs. Todays, intelligence systems could...
متن کاملBridging the semantic gap for software effort estimation by hierarchical feature selection techniques
Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999